Supervised Learning Classification Project: AllLife Bank Personal Loan Campaign¶

Problem Statement¶

Submitted by Larry Weatherford on 7-05-2023¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • id: Customer ID
  • age: Customer’s age in completed years
  • experience: #years of professional experience
  • income: Annual income of the customer (in thousand dollars)
  • zipcode: Home Address ZIP code.
  • family: The family size of the customer
  • ccavg: Average spending on credit cards per month (in thousand dollars)
  • education: education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • mortgage: Value of house mortgage if any. (in thousand dollars)
  • personal_loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • securities_account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • cd_account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • creditcard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Key Questions needing answers:¶

  1. What are the key attributes of a customer that make them more likely to accept a personal loan? Are certain features like income, education, age, or family size more predictive than others?

  2. Are customers who use internet banking facilities more likely to accept personal loans?

  3. How does the amount of average spending on credit cards per month correlate with the likelihood of accepting a personal loan?

  4. Is there a relationship between having a securities account or certificate of deposit (CD) account with the bank and the propensity to accept a personal loan?

  5. How does the level of a customer's education influence their likelihood to accept a personal loan?

  6. Is there any relationship between a customer's age or years of professional experience and their inclination to accept a personal loan?

  7. Are customers with a mortgage more likely to accept a personal loan?

  8. Does the usage of a credit card issued by any other bank (excluding AllLife Bank) have any impact on a customer's decision to accept a personal loan?

  9. How does the customer's annual income influence the likelihood of accepting a personal loan?

  10. Are there specific segments of customers who are more likely to accept personal loans?

Importing necessary libraries¶

In [ ]:
!pip install -q black
from black import WriteBack, Report, Mode, reformat_code, TargetVersion, reformat_one#

# Importing all necessary libraries for this project
import re
import pandas as pd
import numpy as np
import contextlib
import io
import warnings
warnings.filterwarnings("ignore")
from pathlib import Path

import sklearn
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
import graphviz

from sklearn.linear_model import LogisticRegression
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score, auc
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import plot_tree

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
)

#Install Zipcode library and set search variable
!pip install -q uszipcode
from uszipcode import SearchEngine
search = SearchEngine()

Loading the dataset¶

In [ ]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')
In [ ]:
data = pd.read_csv("/content/drive/MyDrive/Projects/PersonalLoanCampaign/Loan_Modelling.csv")
loans = data

Data Overview¶

  • Observations
  • Sanity checks
In [ ]:
loans.shape
Out[ ]:
(5000, 14)
In [ ]:
loans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Observations: 13 of the columns are integers and 1 column (ccavg) is float. No missing values seen. There are 5000 entries with no NULL values.

In [ ]:
loans.duplicated().sum()
Out[ ]:
0

No duplicate values in the data

In [ ]:
loans.head()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1

Observations: ID is just a record ID and add no value. We will drop the ID column.

In [ ]:
# Drop the ID column since it is not relevant to us
loans = loans.drop(['ID'], axis=1)
In [ ]:
# Show the count of unique values for each column
unique_values_count = loans.nunique()
print(unique_values_count)
print("Total unique values: ",loans.nunique().sum())
Age                    45
Experience             47
Income                162
ZIPCode               467
Family                  4
CCAvg                 108
Education               3
Mortgage              347
Personal_Loan           2
Securities_Account      2
CD_Account              2
Online                  2
CreditCard              2
dtype: int64
Total unique values:  1193

List showing the count of unique values of each feature

In [ ]:
loans.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
Age 5000.0 45.338400 11.463166 23.0 35.0 45.0 55.0 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.0 20.0 30.0 43.0
Income 5000.0 73.774200 46.033729 8.0 39.0 64.0 98.0 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.0 93437.0 94608.0 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.0 2.0 3.0 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.7 1.5 2.5 10.0
Education 5000.0 1.881000 0.839869 1.0 1.0 2.0 3.0 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.0 0.0 101.0 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.0 0.0 0.0 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.0 0.0 0.0 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.0 0.0 0.0 1.0
Online 5000.0 0.596800 0.490589 0.0 0.0 1.0 1.0 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.0 0.0 1.0 1.0

Initial Observations from 'describe' results:

  1. Age and Experience: The average age of the customers is approximately 45 years,and the average professional experience is about 20 years. The experience has a negative minimum value, which seems incorrect as years of experience cannot be negative. This could be a data entry error and we will need to examine it more closely.

  2. Income: The average annual income of the customers is about 73.7K, with a large standard deviation of 46K. This suggests significant income disparity among the customers.

  3. ZIPCode: This column is a categorical variable, even though it might contain numerical values. Therefore, using mean, median, or other such statistics would not be meaningfu. we will need to explore the correct way to categorize and handle zipcodes. There are 467 unique zipcodes.

  4. Family: The average family size is approximately 2.4, which indicates that most customers likely have a small family (between 1 to 3 family members).

  5. CCAvg: The average monthly credit card spend is about 1.9K. The maximum monthly spend is 10K, indicating some customers are high spenders.

  6. Eeducation: The median education level is 2 (Graduate), and the mean is approximately 1.88, suggesting that a majority of the customers have at least a Graduate level education.

  7. Mortgage: The average value of house mortgage is approximately 56.5K, but the standard deviation is quite large at 101.7K, indicating a wide range in the values of mortgages. Also, the 25th, 50th percentile is 0, indicating that many customers do not have a mortgage.

  8. Personal Loan: Only about 9.6% of customers accepted a personal loan in the last campaign. This shows that the target variable is highly imbalanced, which needs to be taken into account during model building.

  9. Securities Account, CD Account, Online, Credit Card: These are binary (0 or 1) features. About 10.44% of customers have a securities account with the bank, 6.04% have a CD account, 59.68% use internet banking facilities, and 29.4% use a credit card issued by a different bank.

Actions to take:

  • Convert column names to lowercase for consistency and to make it simpler to code.

  • Group ZIPCode into meaningful regions.

In [ ]:
# Convert column names to lowercase
loans.columns = [i.lower() for i in loans.columns]
In [ ]:
loans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   age                 5000 non-null   int64  
 1   experience          5000 non-null   int64  
 2   income              5000 non-null   int64  
 3   zipcode             5000 non-null   int64  
 4   family              5000 non-null   int64  
 5   ccavg               5000 non-null   float64
 6   education           5000 non-null   int64  
 7   mortgage            5000 non-null   int64  
 8   personal_loan       5000 non-null   int64  
 9   securities_account  5000 non-null   int64  
 10  cd_account          5000 non-null   int64  
 11  online              5000 non-null   int64  
 12  creditcard          5000 non-null   int64  
dtypes: float64(1), int64(12)
memory usage: 507.9 KB

The data dictionary is structured correctly as desired for now.

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Treatment of Zipcode as a category feature

In [ ]:
# Find out how many zipcodes are in the data
len(loans['zipcode'].unique())
Out[ ]:
467

There are 467 unique zipcodes which is too many as a category. We will try to group zipcodes into regions using the uszipcode library. The steps to do this are below:

In [ ]:
# Create a City column
loans['city'] = loans['zipcode'].apply(lambda x: search.by_zipcode(x).city if search.by_zipcode(x) is not None else None).astype('category')
In [ ]:
# Examine the unique cities
unique_cities = loans['city'].unique()
print('List of the ',len(loans['city'].unique()), 'unique counties based on zipcode:\n')
print(unique_cities)
List of the  245 unique counties based on zipcode:

['Pasadena', 'Los Angeles', 'Berkeley', 'San Francisco', 'Northridge', ..., 'San Dimas', 'Signal Hill', 'Tahoe City', 'Weed', 'Stinson Beach']
Length: 245
Categories (244, object): ['Agoura Hills', 'Alameda', 'Alamo', 'Albany', ..., 'Whittier',
                           'Woodland Hills', 'Yorba Linda', 'Yucaipa']
In [ ]:
# Create a County column
loans['county'] = loans['zipcode'].astype('category').apply(lambda x: search.by_zipcode(x).county if search.by_zipcode(x) is not None else None).astype('category')
In [ ]:
# Examine the unique counties
unique_cities = loans['county'].unique()
print('List of the ',len(loans['county'].unique()), 'unique counties based on zipcode:\n')
print(unique_cities)
List of the  39 unique counties based on zipcode:

['Los Angeles County', 'Alameda County', 'San Francisco County', 'San Diego County', 'Monterey County', ..., 'Stanislaus County', 'Shasta County', 'Tuolumne County', 'Napa County', 'Lake County']
Length: 39
Categories (38, object): ['Alameda County', 'Butte County', 'Contra Costa County', 'El Dorado County',
                          ..., 'Trinity County', 'Tuolumne County', 'Ventura County',
                          'Yolo County']
In [ ]:
# Using the California Censue website (https://census.ca.gov/regions/) we can find the regions for each county
# and map each zip code to its appropriate region by using the county column
county_to_region = {
    'Los Angeles County': 'Southern California',
    'Alameda County': 'San Francisco Bay Area',
    'San Francisco County': 'San Francisco Bay Area',
    'San Diego County': 'Southern California',
    'Monterey County': 'Central Coast',
    'Ventura County': 'Southern California',
    'Santa Barbara County': 'Central Coast',
    'Marin County': 'San Francisco Bay Area',
    'Santa Clara County': 'San Francisco Bay Area',
    'Santa Cruz County': 'Central Coast',
    'San Mateo County': 'San Francisco Bay Area',
    'Humboldt County': 'North Coast',
    'Contra Costa County': 'San Francisco Bay Area',
    'Orange County': 'Southern California',
    'Sacramento County': 'Sacramento Valley',
    'Yolo County': 'Sacramento Valley',
    'Placer County': 'Sacramento Valley',
    'San Bernardino County': 'Southern California',
    'San Luis Obispo County': 'Central Coast',
    'Riverside County': 'Southern California',
    'Kern County': 'Central Valley',
    'Fresno County': 'Central Valley',
    'Sonoma County': 'Wine Country',
    'El Dorado County': 'Sierra Nevada',
    'San Benito County': 'Central Coast',
    'Butte County': 'Central Valley',
    'Solano County': 'San Francisco Bay Area',
    'Mendocino County': 'North Coast',
    'San Joaquin County': 'Central Valley',
    'Imperial County': 'Southern California',
    'Siskiyou County': 'Far Northern California',
    'Merced County': 'Central Valley',
    'Trinity County': 'Far Northern California',
    'Stanislaus County': 'Central Valley',
    'Shasta County': 'Far Northern California',
    'Tuolumne County': 'Sierra Nevada',
    'Napa County': 'Wine Country',
    'Lake County': 'Wine Country'
}

# Assign each County to the appropriate Region
loans['region'] = loans['county'].map(county_to_region).astype('category')
In [ ]:
unique_regions = loans['region'].unique()
print('List of the ',len(loans['region'].unique()), 'unique regions based on zipcode:\n')
print(unique_regions)
List of the  10 unique regions based on zipcode:

['Southern California', 'San Francisco Bay Area', 'Central Coast', 'North Coast', 'Sacramento Valley', 'Central Valley', NaN, 'Wine Country', 'Sierra Nevada', 'Far Northern California']
Categories (9, object): ['Central Coast', 'Central Valley', 'Far Northern California', 'North Coast',
                         ..., 'San Francisco Bay Area', 'Sierra Nevada',
                         'Southern California', 'Wine Country']
In [ ]:
# Check for orphaned zipcodes in the regions
orphan_zipcode = loans[loans['region'].isnull()]['zipcode'].unique()
print("Zipcodes that have no Region: ", orphan_zipcode)
Zipcodes that have no Region:  [92717 93077 92634 96651]

There are 4 zipcodes that are aren't associated with a Region. Let's find out what's going on with them.

After researching the internet, we can see these zip codes (92717, 93077,92634) belong to Cities (Irvin or Simi Valley), Counties (Orange or Ventura), and Regions Southern California. We will update the correct values for these rows in the next code. I cannot determine a reliable location of zipcode 96651, since it only has 6 observations, we will drop those rows.

In [ ]:
# Count the observations with zipcode 96651 then delete the rows for that invalid zipcode
count = loans[loans['zipcode'] == 96651].shape[0]
print("Number of rows with zip code 96651:", count)
loans = loans[loans['zipcode'] != 96651]
Number of rows with zip code 96651: 6
In [ ]:
# Impute the correct city, county and region for these zipcodes
loans.loc[loans['zipcode'] == 92717, ['city', 'county', 'region']] = ['Irvine', 'Orange County', 'Southern California']
loans.loc[loans['zipcode'] == 93077, ['city', 'county', 'region']] = ['Simi Valley', 'Ventura County', 'Southern California']
loans.loc[loans['zipcode'] == 92634, ['city', 'county', 'region']] = ['Lake Forest', 'Orange County', 'Southern California']
In [ ]:
# Check for orphaned zipcodes in the regions
orphan_zipcode = loans[loans['region'].isnull()]['zipcode'].unique()
print("Zipcodes that have no Region: ", orphan_zipcode)
Zipcodes that have no Region:  []

We don't have any orphaned zipcodes now

In [ ]:
loans.shape
Out[ ]:
(4994, 16)

*Now that we've sucessfully mapped zipcodes to a region, we can drop the city and county columns as they are no longer of useful to us.

In [ ]:
# Dropping the city and county columnns
loans = loans.drop(["city", "county"], axis=1)
# loans = loans.drop(["city", "county"], axis=1)
In [ ]:
# Check for missing values
missing_values = loans.isnull().sum()
print(missing_values)

# Check for duplicates
duplicate_rows = loans.duplicated().sum()
print("\nNumber of duplicate rows:", duplicate_rows)
age                   0
experience            0
income                0
zipcode               0
family                0
ccavg                 0
education             0
mortgage              0
personal_loan         0
securities_account    0
cd_account            0
online                0
creditcard            0
region                0
dtype: int64

Number of duplicate rows: 0

Double Check for missing values and duplicate rows. All good.

In [ ]:
loans.sample(10)
Out[ ]:
age experience income zipcode family ccavg education mortgage personal_loan securities_account cd_account online creditcard region
786 45 21 42 94305 2 2.5 1 0 0 1 0 1 0 San Francisco Bay Area
4438 43 18 22 90025 2 0.0 3 0 0 0 0 0 0 Southern California
1206 63 37 165 95035 4 5.1 3 0 1 0 0 0 0 San Francisco Bay Area
1900 61 36 10 91365 4 0.4 2 0 0 0 0 1 0 Southern California
1494 59 35 60 90089 1 0.0 2 0 0 0 0 1 1 Southern California
4718 32 6 35 91107 3 1.0 1 0 0 1 0 1 0 Southern California
2242 41 17 45 93437 1 1.8 1 172 0 1 0 1 0 Central Coast
1181 25 0 65 90095 4 0.2 1 0 0 1 0 0 0 Southern California
876 40 14 58 94025 2 2.8 1 0 0 0 0 1 0 San Francisco Bay Area
1018 39 15 61 90018 2 0.6 3 127 0 0 0 0 0 Southern California
In [ ]:
loans.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4994 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   age                 4994 non-null   int64   
 1   experience          4994 non-null   int64   
 2   income              4994 non-null   int64   
 3   zipcode             4994 non-null   int64   
 4   family              4994 non-null   int64   
 5   ccavg               4994 non-null   float64 
 6   education           4994 non-null   int64   
 7   mortgage            4994 non-null   int64   
 8   personal_loan       4994 non-null   int64   
 9   securities_account  4994 non-null   int64   
 10  cd_account          4994 non-null   int64   
 11  online              4994 non-null   int64   
 12  creditcard          4994 non-null   int64   
 13  region              4994 non-null   category
dtypes: category(1), float64(1), int64(12)
memory usage: 551.5 KB

The sample data and data dictionary are looking as expected. We are ready to begin EDA.

EDA Section¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
In [ ]:
def correlation_heatmap(df):
    corr = df.corr()

    plt.figure(figsize=(10,8))
    sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title('Correlation Heatmap', fontsize=16)
    plt.show()
correlation_heatmap(loans)

Key observations:

  1. age and experience: These two features have a very high positive correlation of 0.99, almost 1. This means that as the age of a customer increases, their professional experience also tends to increase, which makes sense intuitively.

  2. income and ccavg: There is a strong positive correlation of 0.65, indicating that customers with higher income also tend to spend more on their credit cards.

  3. income and personal_loan: There is a moderate positive correlation of 0.50. This implies that customers with higher income are more likely to have accepted a personal loan.

  4. ccavg and personal_loan: Similarly, a moderate positive correlation (0.37) suggests that customers who spend more on credit cards per month are more likely to accept a personal loan.

  5. cd_account and personal_loan: There is a significant positive correlation (0.32) between having a certificate of deposit account and accepting a personal loan.

  6. cd_account and securities_account: There is a significant positive correlation (0.32), implying that the customers who have a securities account are also likely to have a CD account with the bank.

  7. education and personal_loan: The correlation is moderately positive (0.14), indicating that higher the education level, more likely the customer is to accept a personal loan.

Observations for future modeling:

  • Variables 'age' and 'experience' are highly correlated, this can lead to multicollinearity if used together in a model. It would be better to drop one of these features during model training to avoid the multicollinearity issue. We will be dropping 'experience'. This also solves the issue of negative values in the experience column.

  • 'income', 'ccavg', and 'cd_account' are relatively more correlated with 'personal_loan' than other features. These might be good predictors for the likelihood of a customer accepting a personal loan.

In [ ]:
# Dropping 'experience' based upon an almost 1-1 correlation with 'age'.  We will keep age and use it in the model.
loans.drop('experience', axis=1, inplace=True)
In [ ]:
def univariate_analysis(df, column):
    fig = plt.figure(figsize=(10, 6)) # Increase the figure size

    # Create grid for plots and summary
    gs = fig.add_gridspec(3, 2, height_ratios=[1, 1, 0.5])  # Adjust height ratios here

    # Histogram
    ax1 = fig.add_subplot(gs[0, 0])
    sns.histplot(df[column], kde=True, ax=ax1, color='skyblue', edgecolor='black')
    ax1.set_title('Histogram')
    ax1.grid(axis='y', linestyle='--', alpha=0.5)

    # Boxplot
    ax2 = fig.add_subplot(gs[0, 1])
    sns.boxplot(x=df[column], ax=ax2, color='lightgreen')
    ax2.set_title('Boxplot')
    ax2.grid(axis='y', linestyle='--', alpha=0.5)

    # Violin plot
    ax3 = fig.add_subplot(gs[1, 0])
    sns.violinplot(x=df[column], ax=ax3, color='plum')
    ax3.set_title('Violin Plot')
    ax3.grid(axis='y', linestyle='--', alpha=0.5)

    # KDE plot
    ax4 = fig.add_subplot(gs[1, 1])
    sns.kdeplot(x=df[column], ax=ax4, color='gold', shade=True)
    ax4.set_title('KDE Plot')
    ax4.grid(axis='y', linestyle='--', alpha=0.5)

    fig.suptitle(f'Univariate Analysis of {column}', fontsize=16)

    # Summary table
    summary_table = df[column].describe().to_frame().transpose()
    summary_table['IQR'] = summary_table['75%'] - summary_table['25%']
    summary_table['IQR+-1.5'] = summary_table['IQR'] * 1.5
    summary_table['IQR+-1.5_lower'] = summary_table['25%'] - summary_table['IQR+-1.5']
    summary_table['IQR+-1.5_upper'] = summary_table['75%'] + summary_table['IQR+-1.5']
    summary_table = summary_table[['min', '25%', '50%', '75%', 'max', 'IQR+-1.5_lower', 'IQR+-1.5_upper']]

    ax5 = fig.add_subplot(gs[2, :]) # This is where the summary table will go
    ax5.set_title(f'Summary of {column.title()}', fontsize=11, pad=8) # Reduced pad for smaller space
    ax5.axis('off')
    ax5.table(cellText=summary_table.values,
              colLabels=summary_table.columns,
              cellLoc='center',
              loc='center')

    plt.tight_layout()
    plt.show()
In [ ]:
univariate_analysis(loans, 'age')

Observations:

  • Age is evenly distributed and there are no outliers.
  • Minimum Age is 23 years and Maximum is 67 years
  • Mean and Median Age are around 45 years
  • There are 5 spikes in the data for age.
In [ ]:
univariate_analysis(loans, 'income')

Observations:

  • Income is right skewed with outliers starting at about 186.5K.
  • Range of Income is from 8k - 224k
In [ ]:
univariate_analysis(loans, 'ccavg')

Observations:

  • Credit Card Avg is right skewed with outliers starting above 5.2
  • Range of CC Avg varies from 0k - 10k
  • Max CCAvg is much higher than Q3. We'll investigate this.
In [ ]:
univariate_analysis(loans, 'mortgage')

Observations:

  • Mortgage is right skewed with outliers beginning at 252.5.
  • Range of Mortgage varies from 0k - 635k
  • Max Mortgage is very much higher than Q3. We'll invesigate to treat the outliers.
In [ ]:
def plot_features(df, features):
    num_features = len(features)
    num_cols = 2
    num_rows = (num_features + 1) // num_cols

    fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 20))

    for i, feature in enumerate(features):
        row = i // num_cols
        col = i % num_cols
        ax = axes[row, col]

        # Create a count plot
        sns.countplot(data=df, x=feature, palette='pastel', ax=ax)

        # Customize the plot
        ax.set_xlabel(features[i])
        ax.set_ylabel('Count')
        ax.set_title(f'Distribution of {feature}')

        # Rotate x-axis labels
        ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')

        # Add percentage labels to the top of each bar
        total = len(df[feature])
        for p in ax.patches:
            height = p.get_height()
            percentage = f'{(height/total)*100:.1f}%'
            ax.annotate(percentage, (p.get_x() + p.get_width() / 2, height), ha='center', va='bottom')

    # Remove empty subplots if the number of features is not a multiple of 2
    if num_features % 2 != 0:
        fig.delaxes(axes.flatten()[-1])

    plt.tight_layout()
    plt.show()
In [ ]:
feature_list= ['personal_loan', 'education', 'family', 'securities_account', 'cd_account', 'online', 'creditcard', 'region']
plot_features(loans, feature_list)

Observations:

  • 9.6% customers have secured personal loans from AllLife Bank. Our goal is too increase this percentage.
  • 42% customers are Undergraduates, 28% Graduates, 30% Post Grads
  • Family size is fairly evenly distributed between the group sizes.
  • 89.6% customers do not have a Secuties Account with AllLife, 10.4% have one.
  • 94% customers do not have CDs with AllLife
  • 59.7% customers have Online banking enabled
  • 70.6% customers do not use Credit Cards from other banks
  • 46% of customers live in Southern Cal, 34% in San Fran area, 20% from other regions

Treatment of Outliers¶

In [ ]:
def plot_boxplots(features):
    num_features = len(features)
    fig, axs = plt.subplots(1, num_features, figsize=(4*num_features, 4))

    for i, feature in enumerate(features):
        axs[i].boxplot(loans[feature], vert=False)
        axs[i].set_title(feature.capitalize())
        axs[i].set_xlabel('Value')

    plt.tight_layout()
    plt.show()
In [ ]:
features = ['income', 'ccavg', 'mortgage']
plot_boxplots(features)

Visulize the outliers of each feature prior to treatment of the outliers.

In [ ]:
# Function to treat the outliers by setting lower and upper wiskers as the boundries for all outliers
def treat_outliers(data, features):
    for feature in features:
        # Calculate Q1, Q3, and IQR
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1

        # Calculate Lower_Whisker and Upper_Whisker
        Lower_Whisker = Q1 - 1.5 * IQR
        Upper_Whisker = Q3 + 1.5 * IQR

        # Replace outliers with whisker values
        data.loc[data[feature] < Lower_Whisker, feature] = Lower_Whisker
        data.loc[data[feature] > Upper_Whisker, feature] = Upper_Whisker

# Apply outlier treatment to the 'income', 'ccavg', and 'mortgage' features in the 'loans' DataFrame
features_to_treat = ['income', 'ccavg', 'mortgage']
treat_outliers(loans, features_to_treat)
In [ ]:
features = ['income', 'ccavg', 'mortgage']
plot_boxplots(features)

Above boxplots shows that the 3 features' outliers have been treated successfully.

In [ ]:
# Function to plot the categorical features
def plot_categorical_data(df, cat_features):
    plt.figure(figsize=(16, 12))
    num_plots = len(cat_features)
    for i, col in enumerate(cat_features):
        plt.subplot(3, 2, i + 1)
        # Count the number of occurrences of each category
        counts = df[col].value_counts(ascending=False)
        # Sort the counts and percentages according to the index of counts
        counts.sort_index(inplace=True)
        # Calculate the percentage for each category
        percentages = 100 * counts / len(df)
        percentages.sort_index(inplace=True)
        # Prepare labels - now only includes percentages
        labels = ['{:.1f}%'.format(p) for p in percentages]
        # Create the bar plot
        ax = sns.barplot(x=counts.index, y=counts, palette='pastel')
        # Add the labels to the bars
        for i, p in enumerate(ax.patches):
            ax.annotate(labels[i], (p.get_x() + p.get_width() / 2., p.get_height()),
                        ha='center', va='bottom', color='black', fontsize='medium')
        # Remove vertical grid
        ax.grid(axis='y', color='gray', linestyle='--', linewidth=0.5)
        # Customizing x-axis and y-axis labels
        plt.xticks(rotation='vertical')
        plt.yticks(rotation='vertical')
        plt.tight_layout()
        # Adding title
        plt.title(f'Distribution of {col.capitalize()} (in %)\n', fontsize=13)
    plt.show()
In [ ]:
# Use binning to group Age, Income, and CCAvg into discernable groups for EDA
age_bins = [0, 35, 55, 100]
age_labels = ['Young Adults', 'Middle age', 'Senior']

income_bins = [0, 39, 99, 1000]
income_labels = ['Low', 'Mid', 'High']

ccavg_bins = [0, 0.7, 2.5, 1000]
ccavg_labels = ['Low', 'Mid', 'High']

# Bin the variables using cut()
loans['age_Group'] = pd.cut(loans['age'], bins=age_bins, labels=age_labels)
loans['income_Group'] = pd.cut(loans['income'], bins=income_bins, labels=income_labels)
loans['ccavg_Group'] = pd.cut(loans['ccavg'], bins=ccavg_bins, labels=ccavg_labels)
In [ ]:
cat_features = ['income_Group', 'ccavg_Group', 'age_Group']
plot_categorical_data(loans, cat_features)

Observations:

  • 49% customers are Mid level earners, 25% are higher earners, 26% low earners
  • 47.4% customers spends 700 - 2,500 on their credit card per month
  • 50.2% customers are Middle Aged (35-55)

Bivariate Analysis¶

In [ ]:
# Removing Zipcode column and relying on Region instead
loans = loans.drop('zipcode', axis=1)
In [ ]:
def evaluate_feature(df, feature):
    plt.figure(figsize=(12, 6))

    # Plotting Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(data=df, x=feature, hue='personal_loan', multiple='stack', kde=True, palette='pastel')
    plt.title('Histogram')

    # Plotting Boxplot
    plt.subplot(1, 2, 2)
    sns.boxplot(data=df, x='personal_loan', y=feature, palette='pastel')
    plt.title('Boxplot')

    plt.tight_layout()
    plt.show()
In [ ]:
# Selecting the feature columns
feature_columns = ['age', 'income', 'family', 'ccavg', 'education', 'mortgage', 'personal_loan']
subset_df = loans[feature_columns]
sns.pairplot(subset_df, hue='personal_loan', palette='pastel')
plt.show()

Observations: The pair plot of Personal Loans with the other features show the non-linear nature of most of the data.

  • Customers of all ages take out loans. Older clients have more income, more expensive mortgages and use credit card more.
  • Customer with higher credit card usage take out more loans, have higher income, and larger families.
  • Customers with larger families tend to have higher income and take out more loans.
  • More educated customers have higher income and CC avg.
  • Higher income customers use their credit cards more.
  • Customers who have high income, larger families, higher credit card usage, of all ages are the most active loan seekers.
In [ ]:
evaluate_feature(loans, 'age')
  • Customers of any age from 23 to 70 should be considered as targets for loans
In [ ]:
evaluate_feature(loans, 'income')
  • Customers earning 100k or more should be coonsdiered targets for loans
In [ ]:
evaluate_feature(loans, 'family')
  • Customers with families of any size should be considered targets for loans
In [ ]:
evaluate_feature(loans,'ccavg')
  • Customers with any credit card usage and especially exceptionally high usage should be considered as targets for loans
In [ ]:
evaluate_feature(loans,'education')
  • Customers with all education levels should be considered for loans with an emphasis on graduate level and above.
In [ ]:
evaluate_feature(loans,'mortgage')
  • Customers without mortgages should be heavily targeted for loans as well as customers with mortages over 250k
In [ ]:
evaluate_feature(loans,'securities_account')

*Customers without a Securities Account should be prioritized in loan marketing efforts

In [ ]:
evaluate_feature(loans,'cd_account')
  • Customers with and without a CD account should be targeted for loans
In [ ]:
evaluate_feature(loans,'online')
  • Customers with and without online banking accounts should be targeted for loans.
In [ ]:
evaluate_feature(loans,'creditcard')
  • Customers that use credit cards issued by different banks should be priortized followed by customers that use AllLife Bank cards.
In [ ]:
# Create a plot to see which regions are getting the most loans
plt.figure(figsize=(14, 8))
order = loans['region'].value_counts(ascending=False).index  # Data order
ax = sns.countplot(data=loans, x='region', hue='personal_loan', palette='husl', order=order)

for p in ax.patches:
    percentage = '{:.1f}%\n({})'.format(100 * p.get_height() / len(loans['personal_loan']), p.get_height())
    # Added percentage and actual value
    x = p.get_x() + p.get_width() / 2
    y = p.get_y() + p.get_height() + 40
    plt.annotate(percentage, (x, y), ha='center', color='black', fontsize='medium')  # Annotation on top of bars
    plt.xticks(color='black', fontsize='medium', rotation=-90)

plt.title('Personal Loan Distribution by Region\n0: No Loan, 1: Loan', color='black')
plt.show()
  • Most loans are from the Southern Cal and San Fran Bay areas, followed by Central Coast and Sacramento Valley. Marketing efforts should be focused on customers primarliy from these areas.
  • Not unexpectedly, these areas have the highest population of customers.

Model Building¶

Model Evaluation Criterion¶

False Positives: Predicting that a person will buy a loan when they actually do not (false positive) can result in a loss of resources for the bank. This includes the cost of marketing efforts, time spent on processing loan applications, and other resources allocated to potential customers who do not end up accepting the loan. Minimizing false positives can help optimize resource allocation and reduce unnecessary costs.

False Negatives: Predicting that a person will not buy a loan when they actually do (false negative) can lead to a loss of opportunity for the bank. It means the bank fails to offer a loan to a potential customer who would have been interested in accepting it. This missed opportunity can impact the bank's revenue and business growth. Minimizing false negatives is important to ensure that the bank identifies and targets all potential customers who are likely to accept a loan.

The relative importance of these cases may vary based on the bank's specific goals, priorities, and cost structures. It is crucial to consider the potential financial impact, customer satisfaction, and business objectives when determining which case is more important for the bank. This consideration will help guide the evaluation metrics and model optimization efforts to focus on minimizing the most significant type of error for the bank's specific context.

For this particular model evaluation, we will use Precision, Recall and F1 Scores as our primary evaluation criteriion. These metrics provide insights into the model's ability to correctly identify loan buyers, minimize false positives, and minimize false negatives. By utilizing these metrics, the model's performance can be assessed and optimized to align with the bank's objectives and priorities.

Model Building: Logistic Regression¶

In [ ]:
# Dropping features that are irrevelent to our model
loans = loans.drop(['age_Group', 'income_Group', 'ccavg_Group'], axis=1)

Logistic Regression - Model1 (LR1)¶

In [ ]:
# Perform one-hot encoding for the 'region' feature
loans_encoded = pd.get_dummies(loans, columns=['region'], drop_first=True)

# Split the data into features (X) and target variable (y)
X = loans_encoded.drop('personal_loan', axis=1)
y = loans_encoded['personal_loan']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print classification report with recall, precision, and F1-score
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
    print(classification_report(y_test, y_pred))
classification_report_output1 = buffer.getvalue()
print(classification_report(y_test, y_pred))

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Calculate percentages for the confusion matrix
cm_percent = cm / cm.sum() * 100

# Prepare new annotations
group_counts = ["{0:0.0f}\n".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2f}%".format(value) for value in cm_percent.flatten()]
labels = [f"{v1}{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

# Plot the heatmap with both values and percentages
plt.figure(figsize=(8, 6))
sns.heatmap(cm_percent, annot=labels, fmt='', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
              precision    recall  f1-score   support

           0       0.96      0.99      0.97       891
           1       0.92      0.63      0.75       108

    accuracy                           0.95       999
   macro avg       0.94      0.81      0.86       999
weighted avg       0.95      0.95      0.95       999

In [ ]:
# Calculate probabilities for the positive class
y_pred_proba = model.predict_proba(X_test)[:,1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Calculate AUC
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure()
lw = 2  # line width
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')  # diagonal line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
  • This model has an ROC curve with an area of 0.96 which indicates that this model has excellent discrimination ability.

  • It suggests that the model is effective in distinguishing between the positive and negative classes, with a high true positive rate and a low false positive rate. The closer the AUC is to 1, the better the model's performance in terms of correctly classifying instances.

  • Therefore, an ROC curve area of 0.96 suggests that this model has a strong predictive power and performs well in distinguishing loan buyers from non-buyers, with a high degree of accuracy.

Model Performance Improvement: Logistic Regression - Model2 (LR2)¶

In [ ]:
# Create a Logistic Regression model with balanced class weight and newton-cg solver
model = LogisticRegression(solver='newton-cg',random_state=1, class_weight={0:0.09,1:0.91})
model.fit(X_train, y_train)

# Set the threshold
threshold = 0.6  # Adjust this threshold as desired

# Make predictions on the test set based on the threshold
y_pred_prob = model.predict_proba(X_test)[:, 1]  # Predict probabilities for the positive class
y_pred = (y_pred_prob >= threshold).astype(int)  # Convert probabilities into class labels based on the threshold+-

# Print classification report with recall, precision, and F1-score
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
    print(classification_report(y_test, y_pred))
classification_report_output2 = buffer.getvalue()
print(classification_report(y_test, y_pred))

# Calculate and plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm_percent = cm / cm.sum() * 100

# Prepare new annotations
group_counts = ["{0:0.0f}\n".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2f}%".format(value) for value in cm_percent.flatten()]
labels = [f"{v1}{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

# Plot the heatmap with both values and percentages
plt.figure(figsize=(8, 6))
sns.heatmap(cm_percent, annot=labels, fmt='', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Calculate and plot the ROC curve
y_pred_proba = model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
              precision    recall  f1-score   support

           0       0.98      0.92      0.95       891
           1       0.55      0.84      0.67       108

    accuracy                           0.91       999
   macro avg       0.77      0.88      0.81       999
weighted avg       0.93      0.91      0.92       999

Model Building: Decision Tree - Model3 (DT1)¶

Visualize the Descision Tree¶

In [ ]:
# Build a Descision Tree model
loans_encoded = pd.get_dummies(loans, columns=['region'], drop_first=True)

# Split the data into features (X) and target variable (y)
X = loans_encoded.drop('personal_loan', axis=1)
y = loans_encoded['personal_loan']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree model
model = DecisionTreeClassifier(random_state=42)

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print classification report with recall, precision, and F1-score
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
    print(classification_report(y_test, y_pred))
classification_report_output3 = buffer.getvalue()

# Print the stored output
print(classification_report(y_test, y_pred))

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Calculate percentages for the confusion matrix
cm_percent = cm / cm.sum() * 100

# Prepare new annotations
group_counts = ["{0:0.0f}\n".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2f}%".format(value) for value in cm_percent.flatten()]
labels = [f"{v1}{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

# Plot the heatmap with both values and percentages
plt.figure(figsize=(8, 6))
sns.heatmap(cm_percent, annot=labels, fmt='', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       891
           1       0.94      0.92      0.93       108

    accuracy                           0.98       999
   macro avg       0.97      0.95      0.96       999
weighted avg       0.98      0.98      0.98       999

Model Performance Improvement: Decision Tree¶

In [ ]:
# Create an instance of the DecisionTreeClassifier
best_dt = DecisionTreeClassifier()

# A routine to show the decison tree's descision rules and display the feature's importance by text and by plot
best_dt.fit(X_train, y_train)

# Specify class names
class_names = ['No Personal Loan', 'Has Personal Loan']

# Specify feature names
feature_names = X_train.columns

# Plot the decision tree
plt.figure(figsize=(20,10))  # set plot size (denoted in inches)
plot_tree(best_dt, filled=True, feature_names=feature_names, class_names=class_names, rounded=True)
plt.show()

# Get feature importances
importances = best_dt.feature_importances_

# Create a DataFrame to display features and their importances
feature_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': importances
})

# Sort the DataFrame by importance in descending order
feature_importances = feature_importances.sort_values('Importance', ascending=False)

print("Feature Importances:")
print(feature_importances)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.title('Feature Importance')
plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.gca().invert_yaxis()  # invert the y-axis to display the most important feature at the top
plt.show()
Feature Importances:
                           Feature  Importance
4                        education    0.394081
1                           income    0.310038
2                           family    0.157860
3                            ccavg    0.065099
0                              age    0.029849
7                       cd_account    0.014951
8                           online    0.010651
5                         mortgage    0.005779
9                       creditcard    0.004790
6               securities_account    0.004723
17             region_Wine Country    0.002178
10           region_Central Valley    0.000000
11  region_Far Northern California    0.000000
12              region_North Coast    0.000000
13        region_Sacramento Valley    0.000000
14   region_San Francisco Bay Area    0.000000
15            region_Sierra Nevada    0.000000
16      region_Southern California    0.000000

Here we visualize the decision rules via the decision tree, show the feature's importance ranked descending, and plot the important features.

Our first decision tree model produces a tree with many nodes and goes quite deep. We will try pruning the tree to see if we can improve the model.

Descision Tree Improvement - Model4 (DT2)¶

In [ ]:
# We are using gridsearch to find the best hyperparameters to improve the model by pruning
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': range(1, 10),
    'min_samples_split': range(1, 10),
    'min_samples_leaf': range(1, 5),
}

# Initialize a DecisionTreeClassifier
dt = DecisionTreeClassifier()

# Initialize a GridSearchCV object with 5-fold cross validation
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy')

# Fit the model to the training data
grid_search.fit(X_train, y_train)
Out[ ]:
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(1, 10),
                         'min_samples_leaf': range(1, 5),
                         'min_samples_split': range(1, 10)},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(1, 10),
                         'min_samples_leaf': range(1, 5),
                         'min_samples_split': range(1, 10)},
             scoring='accuracy')
DecisionTreeClassifier()
DecisionTreeClassifier()
In [ ]:
# Assigning the best parameters to a variable
best_params = grid_search.best_params_
print("The best hyperparameters based upon gridsearch are: ", best_params, "\n")

# Initialize a new decision tree with the optimal parameters
best_dt = DecisionTreeClassifier(**best_params)

# Fit the model and make predictions
best_dt.fit(X_train, y_train)
y_pred = best_dt.predict(X_test)

# Print the accuracy of the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

# Print classification report
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
    print(classification_report(y_test, y_pred))
classification_report_output4 = buffer.getvalue()
print(classification_report(y_test, y_pred))
The best hyperparameters based upon gridsearch are:  {'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 3, 'min_samples_split': 5} 

Accuracy: 0.985985985985986
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       891
           1       0.98      0.89      0.93       108

    accuracy                           0.99       999
   macro avg       0.98      0.94      0.96       999
weighted avg       0.99      0.99      0.99       999

In [ ]:
# Specify class names
class_names = ['No Personal Loan', 'Has Personal Loan']

# Specify feature names
feature_names = X_train.columns

# Plot the decision tree
plt.figure(figsize=(20,10))  # set plot size (denoted in inches)
plot_tree(best_dt, filled=True, feature_names=feature_names, class_names=class_names, rounded=True)
plt.show()

# Get feature importances
importances = best_dt.feature_importances_

# Create a DataFrame to display features and their importances
feature_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': importances
})

# Sort the DataFrame by importance in descending order
feature_importances = feature_importances.sort_values('Importance', ascending=False)

print("Feature Importances:")
print(feature_importances)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.title('Feature Importance')
plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.gca().invert_yaxis()  # invert the y-axis to display the most important feature at the top
plt.show()
Feature Importances:
                           Feature  Importance
1                           income    0.600540
4                        education    0.182512
2                           family    0.119927
3                            ccavg    0.084201
7                       cd_account    0.011809
5                         mortgage    0.001011
0                              age    0.000000
12              region_North Coast    0.000000
16      region_Southern California    0.000000
15            region_Sierra Nevada    0.000000
14   region_San Francisco Bay Area    0.000000
13        region_Sacramento Valley    0.000000
9                       creditcard    0.000000
11  region_Far Northern California    0.000000
10           region_Central Valley    0.000000
8                           online    0.000000
6               securities_account    0.000000
17             region_Wine Country    0.000000

Our 2nd Decision Tree model is visualized showing a max depth of 4 nodes and a shift in feature importance. Let's compare the metrics of the various models in the next section.

Model Comparison and Final Model Selection¶

In [ ]:
# Function to parse classification report strings
def parse_report(report_str):
    lines = report_str.split('\n')
    lines = [line for line in lines if line.strip() != '']

    classes = []
    precision = []
    recall = []
    f1_score = []
    support = []

    for line in lines[1:-3]:  # Skip the first line (header row) and last 3 lines (averages and totals)
        parts = re.split(r'\s+', line)
        classes.append(parts[1])
        precision.append(float(parts[2]))
        recall.append(float(parts[3]))
        f1_score.append(float(parts[4]))
        support.append(int(parts[5]))

    # Create a DataFrame from the parsed data
    df = pd.DataFrame({
        'class': classes,
        'precision': precision,
        'recall': recall,
        'f1-score': f1_score,
        'support': support
    })
    return df

# Parse each report
report_df1 = parse_report(classification_report_output1)
report_df2 = parse_report(classification_report_output2)
report_df3 = parse_report(classification_report_output3)
report_df4 = parse_report(classification_report_output4)

# Add a model column to each DataFrame
report_df1['model'] = 'Model1 (LR1)'
report_df2['model'] = 'Model2 (LR2)'
report_df3['model'] = 'Model3 (DT1)'
report_df4['model'] = 'Model4 (DT2)'

# Concatenate the DataFrames
all_reports_df = pd.concat([report_df1, report_df2, report_df3, report_df4])

# Pivot the DataFrame for easier comparison
pivot_df = all_reports_df.pivot(index='model', columns='class')

#Display the data in a grid style format
pivot_df.style
Out[ ]:
  precision recall f1-score support
class 0 1 0 1 0 1 0 1
model                
Model1 (LR1) 0.960000 0.920000 0.990000 0.630000 0.970000 0.750000 891 108
Model2 (LR2) 0.980000 0.550000 0.920000 0.840000 0.950000 0.670000 891 108
Model3 (DT1) 0.990000 0.940000 0.990000 0.920000 0.990000 0.930000 891 108
Model4 (DT2) 0.990000 0.980000 1.000000 0.890000 0.990000 0.930000 891 108

The table is providing the precision, recall, and F1-score for each loan class (0 and 1) under different models. The support shows the number of instances of each class in the dataset. Here's a brief explanation of each metric:

Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positives. High precision relates to the low false positive rate.

Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to the all observations in actual class. High recall indicates a model that is good at detecting positives.

F1 score: The F1 Score is the weighted average of Precision and Recall. It tries to find the balance between precision and recall.


Model 1 (LR1): The model has high precision for both classes, implying that when it predicts a customer will or will not accept a loan, it is often correct. However, the recall for class 1 is quite low (0.63), which means the model is missing a significant portion of the customers who did accept the loan offer. Therefore, while Model 1 might avoid giving loans to those who won't accept them, it's also likely to miss out on offering loans to a good chunk of customers who would accept.

Model 2 (LR2): This model shows a higher precision for class 0 but lower for class 1 compared to Model 1. The recall for class 1 is higher in this case, suggesting that the model is better at identifying customers who would accept the loan offer, but it is likely to misclassify more customers who would not accept the loan. As a result, the bank might end up targeting some customers who are not interested, wasting marketing resources.

Model 3 (DT1): This model performs exceptionally well across all metrics, with high precision and recall for both classes. This means the model is very good at predicting who will and will not accept a loan offer. This would be very helpful for the bank in effectively targeting customers for loan offers and avoiding wasted resources.

Model 4 (DT2): This model also has very high precision and recall across both classes, indicating that it is excellent at identifying potential loan customers as well as those who are unlikely to accept a loan. It performs slightly better than Model 3 for class 1, making it the best of the four models in terms of balanced performance.

In conclusion, both Models 3 (DT1) and 4 (DT2) are very effective for this task. However, Model 4 (DT2) is the best model to use to maximize the success of the bank's marketing campaign by accurately predicting customers who are likely to accept the loan offers and minimizing wasted marketing resources on those who are unlikely to accept.

Actionable Insights and Business Recommendations¶

  • Tailored Marketing: Implement targeted marketing strategies for each customer group. High profile customers can be reached out through more personal and exclusive channels, such as dedicated relationship managers or premium customer service. For average and low-profile customers, strategies could involve regular follow-ups, email marketing or mobile notifications, stressing on the ease and benefits of availing personal loans.

  • Special Packages: Consider offering pre-approved loans or personalized loan packages to high and average profile customers. Special packages or rates could be offered to customers who maintain a high average balance, have a long-term relationship with the bank, or have multiple accounts/products with the bank.

  • Education Level Programs: The models show that education level is a significant factor in predicting loan uptake. The bank could partner with universities or offer special student loan packages to attract younger clients, who could potentially become long-term customers.

  • Cross-Selling Opportunities: Utilize the information on family size and credit card expenditure to cross-sell other banking products. Larger families might be interested in special savings accounts, insurance products or educational loans. Customers with high credit card expenditure might be good targets for premium credit card offers or investment products.

  • Financial Literacy Programs: For the lower-income or less-educated customers, holding financial literacy programs could help them understand the benefits of different banking products, including personal loans. This can lead to an increased uptake in these products over time.

  • Improving Prediction Models: Continue refining and enhancing the prediction models to ensure they stay accurate and effective. This could include collecting more data over time, adding more features, or exploring other machine learning algorithms. Regular evaluation and update of models are key to maintain their predictive power.